Some words to start with

Welcome to the page! This is an essential part of my final assignment to the course “Introduction to Open Data Science”, or as we friends call it “IODS” course. My name is Laura Matkala and I am a PhD student who studies forests. I have to say this is one of the most inspiring courses I have taken in a while. I will do my best with all the new skills I have learned during the course to make a best possible outcome for this assignment!

Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn’t look like it now, the sun actually does exist…)

Happy forest scientist by a lake in a mountain forest at Mammoth Lakes, CA, USA.(This is here to remind us that although it doesn’t look like it now, the sun actually does exist…)

About the dataset

I chose to use the dataset Boston, which includes data about housing in the suburbs of Boston , Massachusettes, USA. I will later perform linear regression and logistic regression to the variable “crime”, but first some basic information about the dataset.

The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

The dataset has variables related to housing in the suburbs of Boston, Massachusettes, USA. Picture from: http://amtrakdowneaster.com/stations/boston

I have standardized the dataset beforehand as well as explored it a bit. You can find the R script file with all the data wrangling and codes here. The variables in the dataset are:

Analysis

Linear regression

To start with the linear regression I need to read in the data as well as call the needed packages. I will also check the structure and dimensions of the data to see that everything is in order after the data wrangling and saving the file as csv.

Boston<-read.csv(file = "C:/HY-Data/MATKALA/GitHub/IODS-final/Boston.csv", header = TRUE, sep=",")
library(GGally); library(ggplot2)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "high","low","med_high",..: 2 2 2 2 2 2 4 4 4 4 ...
dim(Boston)
## [1] 506  14

Everything seems to be ok with the data and it looks like I meant it to look like at this point. Let’s make a couple of plots to see what the data looks like.

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

I will create a linear multiple regression model, which uses “rad”, “dis” and “ptratio” as explanatory variables for “crime”.